## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Fixed acidity in wines is the result of the contribution of four non-volatile organic acids: tartaric, malic, citric and succinic. In the wines under study, the fixed acidity varies in a wide range from 4.60 up to 15.90 g/L (as tartaric acid). The distribution seems to be unimodal, positively skewed with outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Acetic acid accounts for over 90% of the volatile acidity in wines. The threshold for the human perception of vinegary odor in wines is around 0.7 g/L. In the samples under consideration, the volatile acidity changes in the 0.12 - 1.58 g/L interval, with a median of 0.52.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The concentration of citric acid naturally found in wine is small. As it is a fixed acid, we did not expect a column specifically devoted to it, unless certain amounts of citric acid were intentionally added to the samples at some point in the fabrication process. This assumption is consistent with the high count of samples with no citric acid (132). This variable displays a complex, multimodal distribution, with concentrations ranging between 0.00 and 1.00 g/L.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Residual sugars consist of small amounts of non-fermented pentose and hexose as well as non-fermentable sugars. They are related to the sweetness of wines. Usually, wines with residual sugars less than 4 g/L are considered dry. Meanwhile, those with less than 12 g/L are medium dry. According to that definition, the vast majority of the wines under consideration are classified as dry, a small percentage are medium dry, and there is just a few medium (12 g/L < residual.sugars < 45 g/L).
The distribution of this variable appears unimodal (long positive tail) with a minimum at 0.900 and a maximum at 15.500. The interquartile range is small (IQR = 0.7), with outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chlorides are correlated to the salty flavor of wines. It is in direct relation to the composition of the soil at the location where the grapes were cultivated. In these samples, chlorides are spread in a narrow concentration interval, IQR = 0.02 g/L. The median (0.079) and the mean (0.08747) values of the distribution are different possibly due to the high number of outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Sulfur dioxide (SO2) in its free form is a preservative for wine. In this dataset, it changes in a wide interval (1.00 - 72.00 mg/L). It appears positively skewed, with outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The total SO2 content in the wine is the free SO2 (previous figure) plus the bound SO2. The distribution appears unimodal with a long tail. The mean (46.47) is noticeably higher than the median (38.00), typical of positively skewed distributions. Two major outliers are visible.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The density of these wines varies in a narrow range, close to the density of water. It appears to be normally distributed, with the median (0.9968) and the mean (0.9967) having almost a perfect match.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The pH of wines is acid, typically in the 3.2 - 4.0 range. In the case of these wines, the pH appears normally distributed in a narrow interval (IQR = 0.19) with a median of 3.310.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The distribution of sulfates seems unimodal and positively skewed with a few major outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The wine with the least alcohol content had 8.40 % vol. Meanwhile, the highest topped at 14.90 % vol. That range encompasses a low-alcohol and several medium alcohol levels, according to winefolly.com. The distribution of alcohol seems to be positively skewed with a few outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
In spite of being defined between zero and ten, the actual quality of the samples varies between 3 and 8. The vast majority of them have quality equals either 5 or 6. The median is 5, and the mean is 5.6.
Variables and Units (Unmodified dataset).
X: whole number, unique for each observation (Removed).
fixed.acidity: non-volatile acids found in wines, the most common are tartaric, malic, citric and succinic (as tartaric acid in g/L).
volatile.acidity: largely acetic acid, linked to the vinegary taste (as acetic acid in g/L).
citric.acid: found in small quantities, typically 1/20 of the tartaric acid concentrations (in g/L).
residual.sugar: the sugar that remains in wine after fermentation completes (in g/L).
chlorides: the amount of salt (as sodium chloride in g/L). There are legal limits for chloride content in wines, but it varies widely from country to country (i.e. 0.606 g/L of sodium chloride for Australia while the same limit is set at 0.06 g/L in Switzerland).
free.sulfur.dioxide: SO2 [molecular] + HSO3 [bisulfites] + SO3 [sulfites]. It is the buffer against microbes and oxidation (in mg/L)
total.sulfur.dioxide: free sulfur dioxide + bound sulfur dioxide [sulfites attached to either sugars, acetaldehyde or phenolic compounds] (in mg/L)
density: mass divided by volume, the density of wines are close to that of water (in g/cm3)
pH: -log10 of the activity of the hydrogen ion, it ranges from 0 to 14. (Zero is the most acidic, 7 is neutral, and 14 is the most basic); pH values for wines are typically between 3 and 4.
sulphates: The concentration of sulfates. Sulfur dioxide, upon oxidation, is converted into sulfate (as potassium sulfate in g/L)
alcohol: the amount of alcohol in % vol. For wines, it varies in a wide range, between 5.5% (Moscato d’Asti ) up to 21% (fortified wines).
quality: a whole number to qualify the wine. It ranges from zero (bad) to ten (excellent), it is a subjective measure. For this dataset, it was the median of a blind testing of at least three different experts for each sample.
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The most noticeable variable in this dataset is the quality of the wine. That is a unique feature because it is the only one that does not come directly from an instrument or a mathematical derivation of a direct measurement. So far we depend on human experts to assign a quality number to wine. For that reason, any way to correlate ‘objective’ variables with quality might contribute to our understanding of what makes a wine bad or good.
It appears that volatile acidity and alcohol content are the variables with the strongest correlation to quality. We are going to follow them closely, but also we are going to investigate most of the variables in our dataset.
Immediately after loading the .csv file, we deleted X (we did not find a proper use for it)
After creating the correlation matrix (that required only numerical variables), we proceeded to convert quality to a factor variable.
At that point, we also created two more ordered categorical variables using existing variables:
quality.level, with four levels: “Low” for quality equal to 3 and 4, “Medium Low” for quality equals 5, “Medium” for quality equals 6 and “High” for quality equals 7 and 8.
alcohol.level, with three levels: “Low Alcohol” for alcohol content below or equal to 10 % vol., “Medium Alcohol” for alcohol content higher than 10 % vol and less than or equal to 11.5 % vol. and “High Alcohol” for alcohol contents greater than 11.5 % vol.
The histogram for the variable ‘citric.acid’ was unexpected. There are some sharp peaks, that would be caused by the addition of certain preferred amounts of citric acid to the samples (including not adding citric acid at all). Other than to create new variables, as mentioned above, we did not modify the existing data in any way.
There are some strong correlations in the data:
pH and fixed acidity (Pearson’s correlation coeff. -0.7). Acid pH values are located in the lower part of the scale (pH < 7). As acidity increases, pH values decrease.
citric acid and fixed acidity (0.7). Citric acid is a fixed acid. We expect it contributes to increasing the fixed acidity as more citric acid is added to the wine.
fixed acidity and density (0.7). Fixed organic acids are denser than water. Upon addition to wine, fixed organic acids (such us tartaric, malic, citric, etc.) increase the density of the mix.
total sulfur dioxide and free sulfur dioxide (0.7). Free sulfur dioxide is a subset of total sulfur dioxide.
citric acid and volatile acidity (-0.6). Surprising, since the citric-sugar co-metabolism increase the formation of volatile acid in wine.
pH and citric acid (-0.5). Citric acid is a fixed acid (same as in a)
alcohol and density (-0.5). Alcohol is less dense than water. As the concentration of alcohol increases in wines, their densities go down.
alcohol and quality (0.5). Quality is a subjective index. Alcohol displays the strongest correlation between quality and any variable. That indicates a tendency of the human taste to associate quality and alcohol level in wines in a positive way.
Box plots of citric acid vs. quality shows an increase in the median values as quality goes up. The median values of citric acid concentration for each particular value of quality go from being lower than the median of all the samples (dashed line) at low qualities, to be clearly higher at qualities equal 6 and 7. The IQRs appear wide, but there are just a few outliers in the data. For low values of quality (3 and 4) the distribution appears positively skewed, this trend is slowly reversed as quality goes up.
As a general trend, density decreases when quality goes from 3 to 8. Quality values 5 and 6 show some outliers. IQRs tend to increase for higher values of quality.
At quality equals 5, there is a maximum for the median of the total SO2 content in the samples. It also displays the highest dispersion in the data. It seems like obtaining middle-quality wines does not imply significant restrictions for the sulfur dioxide content.
The median value of the concentration of sulfates in red wines slowly increases with quality. Quality levels 5 and 6 show the highest number of outliers. The IQRs are modest.
The volatile acidity has a tendency to decrease as quality rises. That make sense because the volatile acidity is mainly originated by the acetic acid (a fermentation byproduct). The distributions are consistent with a normal-like behavior for qualities 5 and 6. The IQRs tend to decrease with quality.
The median of the alcohol content distributions by quality reached a minimum at quality = 5, then started increasing and go over 12 % vol by the time it reached quality = 8. The dispersion tends to grow at the upper-quality values, but at the same time, the median and the mean get closer to each other. Except for quality = 5, there are just a few outliers in the data.
So far, we can argue (according to our results), that human taste correlates high alcohol content with high-quality wines.
We expect a decrease in the median value of wine density as the alcohol level goes up. The detail that could be a little more intriguing is why the interquartile range of density seems to increase for high alcohol contents.
We observe a consistent decrease in the median values of volatile acidity as the alcohol level goes up. Concerning the acetic acid content, the taste of wine starts becoming unpleasant at concentrations greater than 1.2 g/L. According to the picture above, that represents just a few samples in our dataset. Below that limit, winemakers usually change the values of volatile acidity pursuing the desired flavor.
An increase is observed in the median values of sulfates as alcohol level rises, although it changes in a narrow interval. Low alcohol level samples tend to display more outliers than the rest.
Those four graphs show the strongest correlations among our variables. We are going to dig deeper into them, searching for any hidden dependency on quality. In the case of free sulfur dioxide vs. total sulfur dioxide, we removed the top 0.5 % of the total sulfur dioxide data (two outliers) from the plot.
As an average, as quality goes from 3 up to 8, the concentrations of citric acid, sulfates and alcohol increase. Volatile acidity, density, pH, and chlorides (not all shown), showed the opposite trend; they decrease as quality goes up. The total sulfur dioxide concentration peaks at 5 and then decreases. Residual sugar (not shown) did not change appreciably within the quality range.
Variations in some variables across alcohol levels are, in general, less sharp than in the case of quality. We observed a slight average increase in the concentration of citric acid and sulfates as alcohol levels go up. Meanwhile, density, volatile acidity and chlorides dropped in the same interval of alcohol levels.
As expected, pH values went down as acidity increased, meanwhile density and alcohol display opposite trends.
We found a total of four relationships that were almost equally strong. The strongest among them was between pH and fixed acidity (Pearson’s correlation factor -0.683, that increases further if we take log10(fixed.acidity)). That high value of the correlation coefficient is expected as both magnitudes are related to the Chemistry of acid dissociation in the context of acid-base reactions. Fixed acidity generates strong relationships with citric acid and density, as shown. Also, total sulfur dioxide and free sulfur dioxide are highly correlated.
There is a general reduction of the scattering as the quality level rises in the four plots. That could be related to a more restricted change interval for the given variables as quality improves (i.e. maybe a change of 0.5 pH units for a given fixed acidity in some two-variables space regions still produces a “Medium Low” quality wine but not a “High” quality one). The data points belonging to different quality levels are mixed in a way that with the aid of the variables plotted, we were not able to identify “isolated regions” of given qualities.
The same four combinations of variables, now faceted by alcohol level, produced a result similar to the one previously shown for quality levels (the correlation coefficient increases from low to high alcohol levels). Also, a quick look at the quality composition by alcohol level reveals that “Medium Low” quality samples dominate at low alcohol, “Medium” quality does it at “Medium Alcohol” and the “High Alcohol” level is almost exclusively populated by medium and high-quality samples. All of this adds support to the connection between quality and alcohol.
The volatile acidity vs. alcohol plot, colored by quality level (upper graph) confirms what we already know: In average, high-quality wines tend to have high levels of alcohol and low volatile acidities. There is not a clear pattern that could allow us to identify “zones” mostly populated by single-quality samples. That is probably because Medium-Low and Medium-quality samples are spread over the entire range of change of the variables. However, the lines associated with the linear (Pearson’s) correlation show a definite trend about volatile acidity.
That slightly overplotted graph is maybe a good example of the flexibility in the range of variables that allowed Portuguese winemakers to obtain high percentages of at least average quality red wines. Is there any room for improvements?
The graph at the bottom shows the same variables plotted in the upper graph, but in this case, the extreme quality levels samples are highlighted (high quality in red, low quality in blue). The median values for volatile acidity and alcohol (dashed lines) divide the graphs into 4 quadrants; we named them as in the case of trigonometric functions: starting at “I” for the upper rightmost quadrant and following the labeling process in a counterclockwise manner up to “IV” for the bottom-right one.
It appears that low-quality samples are mostly located in quadrant II. Meanwhile, high quality points overwhelmingly populate quadrant IV.
In the case of the plots of sulfates vs. alcohol, we observe a similar pattern (upper graph) than in the case of volatile acidity vs. alcohol: samples of average qualities spread all over the range but with a clear trend in the linear correlation plots. When we highlighted the extreme cases (bottom graph), we observed that high-quality samples have in average high sulfate content (upper half) than low-quality samples (bottom half), with little overlap.
The sulphates vs. volatile acidity graph behave as expected. There is a negative correlation between the two variables. Quadrant II is mostly populated by high-quality samples and quadrant IV is dominated by low-quality wines.
When we split the four strongest correlations across quality levels, we observed that they tend to be more robust as we moved in quality from low to high. The same applies if we substitute quality levels by alcohol levels. However, using this kind of graphs we were not able to identify any “zones” of quality segregation.
We decided to make a more detailed analysis using variables that correlate better with quality and narrowed our exploration down to the three strongest: alcohol, volatile acidity and sulfates.
After making three different scatter plots of these variables, we observed some conditions linked to the existence of high-quality samples:
Low volatile acidity values and high alcohol content.
High sulfates and high alcohol concentrations.
High concentration of sulfates and low volatile acidity.
Where “Low” and “High” mean below or above the respective median value of the variable.
The two conditions involving sulfates tend to produce more definite boundaries.
When highlighted, the extreme quality levels (“Low” and “High”) could be resolved in some two-variable spaces (at least in those related to the three variables with strongest correlation with quality).
We resisted the temptation of creating a new variable named “total acidity” by adding up “fixed.acidity”, “volatile.acidity”, and “citric.acid”. A closer look at the variables makes clear that they are expressed as different acids. Unfortunately, even by doing that correctly with the intention of creating a model to predict the pH values based on total acidity (and using a simple equilibrium constant model), our predicted pH values were off by 0.3 to 0.9 pH units. For that reason, we decided to exclude it from this work.
The bar plot displays the composition of our dataset regarding quality and alcohol levels. It does not only show that quality 5 and 6 are the most populated but also that the number of low alcohol samples peaks at quality equals to 5, medium alcohol peaks at 6 and high alcohol does it at 7. The relative composition of alcohol levels within a single quality value changes. At low values of quality (less than or equal to 5) samples with low alcohol dominate, at quality equals to 6 “Medium Alcohol” samples are the most common and finally the composition shifts towards “High Alcohol” at 7.
We chose the “quality.level” variable in such a way that it is consistent with this graph. We merged samples from qualities 3 and 4 to create the level “Low,” and those from quality 7 and 8 to create “High.” We decided to leave 5 and 6 as independent levels (“Medium Low” and “Medium,” respectively) mainly because they have different alcohol level compositions.
According to this graph high quality is associated with high concentration of alcohol in wines.
Although alcohol seems to be relevant to wine quality; other variables also change through all the quality range. In the figure above we chose four of them: citric acid, volatile acidity, sulfates and density. They all have monotonic trends with quality, and while citric acid and sulfates increase, volatile acidity and density do the opposite as quality goes from 3 to 8. Citric acid is a weak organic acid; its addition helps by chelating metal ions to prevent browning, increasing the acidity in the process. Too much citric acid affects the taste of red wines negatively and causes microbial instabilities since bacteria use citric acid in their metabolism.
Volatile acidity is mostly due to the presence of acetic acid in wines. It is responsible for the vinegary odor and taste. For that reason, it is not a surprise that there is a decreasing trend between volatile acidity and quality. Wine is water, for the most part. That explains why its density is close to 1 g/cm3. The presence of alcohol makes wine density lower. As the alcohol % vol increases with quality, we expect a decreasing in wine density as quality rises. In a simple water-ethanol solution a 10 % vol alcohol concentration (which is typical for wines) would lead to a density of about 0.97 g/cm3. What we observe in our data are higher values of density, so there are other components of wine also playing a role. The final result is a slim density interval. One of the known sulfates (or sulphates in British English) used in wines is copper sulfate. It is a fining agent, used to remove unpleasant aromas in wine, particularly those related to hydrogen sulfide (rotten-egg-like off aromas). That explains, at least in part, the positive correlation between sulfates and quality.
After plotting some data, it seems like variables belonging to samples with qualities 5 and 6 are dispersed throughout the entire range. It makes near impossible to find clean areas in two-variable spaces (not involving a quality variable) where clear boundaries between quality levels exist. It appears that, after centuries of wine mass production, winemakers at Minho province (Portugal) have found a safety range of values for the variables under consideration that allows them to produce a significant percentage of medium qualities (5-6) red wines. We decided then to explore what happens to just the extreme levels (“Low” and “High” quality). In the figure above we highlighted data points from quality levels “Low” and “High”. We plotted sulphates vs. volatile acidity (top) and then sulphates vs. alcohol (bottom left) and volatile acidity vs. alcohol (bottom right).
We observed that three pairs of conditions are correlated to high-quality samples. In the case of sulphates vs. volatile acidity, most of high-quality samples exist in a condition involving high sulfate concentrations and low volatile acidity values. For sulphates vs. alcohol the region of high-quality samples is high alcohol and high sulfates. Finally, in the case of volatile acidity vs. alcohol there is more dispersion in the data, but there is a tendency for the high-quality samples to be at high alcohol and low volatile acidity. In this context “high” and “low” are referred to above or below the corresponding median value (dashed lines).
The conditions involving sulfates produced sharper results in relation to isolate a region without low-quality samples or with just a few of them.
Assigning a quality number to wine relies entirely on the human taste in all its subjectivity realm. Throughout this study, we tried to identify variables and conditions that could help us rationalizing what remains otherwise more an art than a science.
We worked with a dataset that has been around for a while now; regardless, we struggled at first making sense of all the variables involved. As an example, we had a hard time figuring out why is there a dedicated column for citric acid (a fixed acid), and there are no columns for tartaric or malic acids, provided that both tartaric and malic appear in higher concentrations.
From the beginning, we tried to find intervals of change in variables and connect them to values of quality. After dozens of plots there was not a clear pattern. At some point, we realized that maybe we were dealing with standard procedures optimized over the years (centuries even) to maximize the amount of average-quality wines. We then tried to search for the best and the worse quality levels. If there was an opportunity to improve the overall wine quality, it probably relied on tweaking some variables away from the conditions where low-quality wines exist.
We found that in average, alcohol and sulfates have a positive influence on wine quality meanwhile volatile acidity has an adverse impact on it. We also noticed after some search that there is at least two configurations of those variables that could potentially improve the overall quality of the batches by minimizing the proportion of low-quality wines.
In the future, it could be interesting continuing the exploration in that direction. We acknowledge that the small amount of data could be a serious drawback.
[1] P.Cortez, A.Cerdeira, F. Almeida, T. Matos and J. Reis, “Modeling wine preferences by data mining from physicochemical properties” Decision Support Systems, 47(4), 2009, 547-553, ISSN: 0167-9236.